CRESST REPORT 788 



David Silver 
Mark Hansen 
Joan Herman 
Yael Silk 
Cynthia L. Greenleaf 



IES INTEGRATED LEARNING 
ASSESSMENT FINAL REPORT 

MARCH, 2011 




The National Center for Research on Evaluation, Standards, and Student Testing 



Graduate School of Education & Information Sciences 
UCLA | University of California, Los Angeles 



IES Integrated Learning Assessment Final Report 



CRESST Report 788 



David Silver, Mark Hansen, Joan Herman, and Yael Silk 
CRESST/University of California, Los Angeles 

Cynthia L. Greenleaf 
WestEd 



March, 2011 



National Center for Research on Evaluation, 
Standards, and Student Testing (CRESST) 
Center for the Study of Evaluation (CSE) 
Graduate School of Education & Information Studies 
University of California, Los Angeles 
300 Charles E. Young Drive North 
GSE&IS Bldg., Box 951522 
Los Angeles, CA 90095-1522 
(310) 206-1532 




Copyright © 201 1 The Regents of the University of California. 

The work reported herein was supported under a subcontract (S05-059) to the National Center for Research on 
Evaluation, Standards, and Student Testing (CRESST) funded by the US Depart of Education, grant number 
R305M050031 and administered by the WestEd Agency. 

The findings and opinions expressed in this report are those of the authors and do not necessarily reflect the 
positions or policies of the US Dept of Education or the WestEd Agency. 

To cite from this report, please use the following as your APA reference: Silver, D., Hansen, M., Herman, J., 
Silk, Y., & Greenleaf, C.L. (201 1). IES integrated learning assessment final report. (CRESST Report 788). Los 
Angeles, CA: University of California, National Center for Research on Evaluation, Standards and Student 
Testing (CRESST). 




TABLE OF CONTENTS 



Abstract 1 

Introduction 1 

Project Background 1 

Why Develop the ILA? 2 

Rationale for the ILA Format 2 

Development of the Biology ILA 4 

Content Focus Selection 4 

Test Specification 4 

Text Selection 6 

Item Generation 7 

Development of the History ILA 7 

Content Focus Selection 7 

Test Specification 9 

Text Selection 10 

Item Generation 12 

Structure of the ILA 13 

Overview 13 

Part 1: Content Knowledge 14 

Part 2: Reading Comprehension, Metacognition, and Reading Strategies 14 

Part 3: Writing Assessment 16 

Methods 20 

Sample 20 

The Scoring Session 21 

Reliability of ILA Scores 21 

Theoretical and Statistical Models of RA Effects 32 

Results 34 

Estimated Treatment Effects and Correlations Among Study Variables 34 

Conclusion 46 

References 48 

Appendix A: Biology ILA (Parts 1, 2, 3) 51 

Appendix B: History ILA (Parts 1, 2, 3) 69 

Appendix C: ILA Teacher Feedback Survey 87 

Appendix D: Metacognition Scoring Rubric 91 

Appendix E: Reading Strategies Scoring Rubric 93 

Appendix F: Writing Content Scoring Rubric 95 

Appendix G: Writing Language Scoring Rubric 97 

Appendix H: Writing Language Rubric Description 99 

Appendix I: IRT-Based Analysis of Multiple Choice Tests of Content Knowledge 103 



iii 




IES INTEGRATED LEARNING ASSESSMENT FINAL REPORT 



David Silver, Mark Hansen, Joan Herman, Yael Silk 
CRESST/University of California, Los Angeles 

Cynthia L. Greenleaf 
WestEd 

Abstract 

The main purpose of this study was to examine the effects of the Reading 
Apprenticeship professional development program on several teacher and student 
outcomes, including effects on student learning. A key part of the study was the use of 
an enhanced performance assessment program, the Integrated Learning Assessment 
(ILA), to measure student content understanding. The ILA instruments included 
multiple components that assessed student content knowledge, reading comprehension, 
metacognition, use of reading strategies, and writing skills in applied knowledge. An 
analysis of student scores using the ILA found little or no significant effects from the 
Reading Apprenticeship program on class-level student outcomes. However, the 
researchers found a significant positive effect on teachers’ literacy instruction. 

Introduction 



Project Background 

The major aim of this study was to examine the effects of assignment to the Reading 
Apprenticeship (RA) professional development program on several teacher and student 
outcome variables. In other words, the study sought to evaluate the effects of the RA 
professional development program on teacher practices and student learning. Biology and 
history high school teachers were recruited for the study and the effects on students’ literacy 
within the subject area were of particular interest. 

Large scale assessments, such as the California Standards Test (CST) and Arizona’s 
Instrument to Measure Standards (AIMS), may honor breadth over depth of student 
knowledge and comprehension; thus, we developed a supplementary, more detailed measure 
in order to examine the potential effects of the RA on student learning. This new 
performance-based assessment is called the Integrated Learning Assessment (ILA). The ILA 
integrates adaptations of CRESST’s model-based assessments for measuring content 
understanding (Baker, Freeman, & Clayton, 1991; Chung, Harmon, & Baker, 2001; 
Herman, Baker, & Linn, 2004; Baker, Aschbacher, Niemi, & Sato, 1992) with WestEd’ s 
Strategic Literacy Initiative’s Curriculum Embedded Reading Assessment (CERA). As a 
starting point, we defined acquisition of conceptual understanding in biology and history to 



1 




include the mastery of particular concepts and ideas through engagement with texts as well 
as the ability to effectively integrate concepts into the formulation of explanations. The 
purpose of this report was to provide information about the development process and the 
preliminary results of the ILA, including the reliability and distribution of scores for our 
study. 

Why Develop the ILA? 

The ability to read and write in a discipline-specific context is increasingly recognized 
as a critical skill for high performance in an academic setting. Despite widespread 
acknowledgement that disciplines such as science and history demand multi-faceted literacy 
skills, few validated instruments have been developed that evaluate the impact of discipline- 
specific literacy instruction on student outcomes. The ILA was designed to evaluate both 
discipline-specific content knowledge and literacy skills integral to the successful access 
and display of content knowledge. As an evaluation tool for skills and knowledge, the ILA 
was designed to measure how students use RA-guided interactive reading skills, as well as 
how these skills influence student achievement. Specifically, the ILA was developed as a 
measure to examine the extent to which students utilize cognitive and meta-cognitive skills 
considered essential for substantial engagement with scientific and historical texts. 

In the RA instructional model, students are frequently exposed to texts in both primary 
and secondary source types (e.g., textbooks, web resources, articles, data tables); 
furthermore, students are regularly provided with a variety of comprehension strategies to 
help access the unique content in these materials. With the use of RA reading 
comprehension strategies, students are increasingly prompted and expected to explain their 
understandings or questions through elaborated and detailed descriptions. Over time, as 
students attempt to discuss and explain increasingly complex concepts, they begin to need 
more academic language in order to communicate these ideas effectively. 

Rationale for the ILA Format 

National standards for history and science education emphasize the importance of 
providing students with opportunities to present their understanding as well as use 
knowledge and academic language to communicate explanations and ideas (see National 
Research Council, 1996). Due to the close relationship between reading and listening (input) 
and writing and speaking (response) in discipline-specific literacy as well as in the RA 
instructional model, we included both reading and writing in the ILA to measure the 
effectiveness of RA instruction on knowledge and literacy acquisition. 



2 




While the written explanation genre was not an explicit component of RA, it is 
expected that students in RA classrooms would have sufficient experience with this genre 
through their utilization of a variety of writing formats. Moreover, through their exposure to 
scientific and historical texts, these students are expected to have developed sufficient 
familiarity with academic language to have high functionality on this measure. 

CRESST’s model-based assessment uses standard architectures embedded in 
disciplinary content to assess core types of learning — basic knowledge, conceptual 
understanding, problem solving, communication, and teamwork. Two different templates 
have been developed and extensively validated as a way to evaluate content understanding 
and communication. One requires students to generate written explanations given primary 
source materials; the other utilizes computer-based knowledge mapping to display 
comprehension (see Baker, 1994; Chung, O’Neil, & Herl, 1999). Given the current study’s 
focus on the integration of discipline-specific literacy and content knowledge, CRESST’s 
explanation architecture provided us with the ideal format to evaluate this array of skills 
simultaneously. The use of an explanation and argumentation task with reading prompts and 
a writing component provided us with an integrated approach to measuring students’ ability 
to understand and communicate their knowledge. In addition, explanation tasks are a 
dominant genre of school-based writing (Martin & Miller, 1988) and are well-suited for on- 
demand assessment conditions. 

The explanation and argumentation architectures may have particular relevance for the 
assessment of scientific literacy — given that science is about the construction of theories 
that explain how the world operates. Scholars have noted that discourse, explanation, and 
argumentation are at the heart of science learning (Boulter & Gilbert, 1995; Duschl & 
Osborne, 2002; Erduran, Simon, & Osborne, 2004; Pontecorvo, 1987). The explanation 
architecture provides a format that examines whether students are able to integrate a 
complex structure of biological as well as other related concepts, the relationships between 
these concepts, the reasons for these relationships, and ways to explain and predict other 
natural phenomena. 

Separate ILA instruments were developed for administration to biology and history 
students. Copies of these instruments are provided in Appendices A and B. In the following 
sections of this report, we describe the design of the measures for the respective subject 
areas. Specifically, we present the particular content focus of each ILA, review test 
specifications, and describe the process of item generation. 



3 




Development of the Biology ILA 

Content Focus Selection 

After reviewing California state content standards for biology and life sciences, we 
created an ontology — a systematic arrangement and categorization of concepts in a field of 
discourse. Developing this type of organizational structure allowed us to uncover the 
relationships between different biology concepts (e.g., what concepts encompass the 
precursor knowledge-set needed to understand a specific standard). This involved unpacking 
and elaborating the standards to create a hierarchy of conceptual information. This hierarchy 
of information was then used to: (a) create a framework for content understanding; (b) shape 
the design of the ILA; and (c) guide the development of the content rubric. Using the 
California Science Teachers Association’s Making Connections: A Guide to Implementing 
Science Standards (Bertrand, DiRanna, & Janulaw, 1999) as a guide, we examined the 
standards in two specific science content areas — genetics and physiology. Based on 
preceding units, content standards and sub-standards, as well as science standards from 
earlier grade levels (e.g., California Science Standards for Grade 7), we determined that 
prior content knowledge that would be necessary to understand target standards. 

The section of biology content targeted for the ILA was the unit in genetics, which is 
well represented in the California Standards Test (CST) for biology. We chose the topic of 
genetics for the ILA because we expected teachers to spend more instructional time 
covering this content area than other units — given its emphasis on the CST. Overall, the 
review of content standards and the development of a biology ontology provided us with the 
context for determining the specific content targeted in the ILA. 

Test Specification 

The first phase in developing the test specification for each ILA was to provide a 
detailed description of what was to be tested. Based on a review of the standards in the two 
subject areas, it was determined that the ILA should incorporate high-level cognitive skills 
such as analysis, interpretation, evaluation, and synthesis of the information presented in the 
ILA texts — combined with the material learned in class. The tasks in the ILA were aimed at 
eliciting students’ use of higher-level cognitive skills when engaged in reading, analyzing, 
evaluating, and synthesizing the documents through writing. Figures 1 through 3 show the 
target standards for content knowledge, reading, and writing. 



4 




BIOLOGY/LIFE SCIENCES STANDARDS: GENETICS 

Standard 5: The genetic composition of cells can be altered by incorporation of exogenous DNA into the cells. 
Sample basis for understanding this concept: 

5a. Students know the general structures and functions of DNA, RNA, and protein. 

5b. Students know how genetic engineering (biotechnology) is used to produce novel biomedical and agricultural 
products. 



Figure 1. Biology/life sciences content standards targeted in the ILA. 



READING COMPREHENSION STANDARDS 

Standard 2.0: Read and understand grade-level-appropriate material. Analyze the organizational patterns, 
arguments, and positions advanced. 

Structural Features of Informational Materials: 

Standard 2.5: Extend ideas presented in primary or secondary sources through original analysis, evaluation, and 
elaboration. 



Figure 2. Reading comprehension standards targeted in ILA. 



WRITING STANDARDS 

Writing Strategies: 

Standard 1 .0: Write coherent and focused essays that convey a well-defined perspective and tightly reasoned 
argument. The writing demonstrates students' awareness of the audience and purpose. 

Standard 1.1: Establish a controlling impression or coherent thesis that conveys a clear and distinctive perspective 
on the subject and maintain a consistent tone and focus throughout the piece of writing. 

Standard 1.2: Use precise language, action verbs, sensory details, appropriate modifiers, and the active rather than 
the passive voice. 

Writing Applications: 

Standard 2.0: Combine the rhetorical strategies of narration, exposition, persuasion, and description to produce texts 
of at least 1 ,500 words each. Student writing demonstrates a command of standard American English and the 
research, organizational, and drafting strategies. 

Standard 2.3: Write expository compositions, including analytical essays and research reports 

a. Marshal evidence in support of a thesis and related claims, including information on all relevant perspectives. 

b. Convey information and ideas from primary and secondary sources accurately and coherently. 

c. Anticipate and address readers' potential misunderstandings, biases, and expectations. 

d. Use technical terms and notations accurately. 

Written and Oral English Language Conventions: 

Standard 1.0: Write and speak with a command of standard English conventions. 

Grammar and Mechanics of Writing 

Standard 1.1: Identify and correctly use clauses (e.g., main and subordinate), phrases (e.g., gerund, infinitive, and 
participial), and mechanics of punctuation (e.g., semicolons, colons, ellipses, hyphens). 

Standard 1.2: Understand sentence construction (e.g., parallel structure, subordination, proper placement of 
modifiers) and proper English usage (e.g., consistency of verb tenses). 

Standard 1 .3: Demonstrate an understanding of proper English usage and control of grammar, paragraph and 
sentence structure, diction, and syntax. 



Figure 3. Writing standards targeted in the ILA. 



5 







Text Selection 

To inform our text passage selection for the Biology ILA, we examined what linguistic 
resources are used to create scientific meaning and the level of reading comprehension 
proficiency that is required at the high school level. To gain this understanding, we 
conducted a linguistic analysis of high school biology text books — Prentice Hall’s Biology 
(Miller & Levine, 2006) and BSCS’s BSCS Biology: A Molecular Approach (Greenberg, 
2001). The results of the analysis were used as a basis for selecting and modifying the texts 
used in the ILA. 

Overall, we found that high school science textbooks displayed high technicality and 
abstractness. This was evidenced by frequent occurrences of technical vocabulary and 
abstract nouns. In addition, various instances of “grammatical metaphor” (see Halliday, 
1994) were identified in biology textbooks. 1 For example, experiential information (i.e., 
what is happening in the text) was frequently expressed in nominal groups through 
nominalization (e.g., forming the noun “invasion” from the verb “invade”). These nominal 
groups were further expanded through the addition of an embedded clause, an adjective, or a 
prepositional phrase, which resulted in high lexical density. Relationships between 
experiential elements were marked through various connectors including conjunctions but 
verbal groups often subsumed the marking of conjunctive relationships (e.g., “to be 
followed by” instead of “and then”). 

This comparative analysis helped us select the final passage to include in the ILA. The 
passage was similar to textbook passages in terms of linguistic difficulty. The final text 
passage was selected from an internet site 2 listed as a supplemental resource for students in 
state-adopted biology textbooks (Miller & Levine, 2006). 

After the text passage selection, we conducted an external review of text passages. 
This step involved consulting with genetic scientists at the University of California, Los 
Angeles (UCLA) for content accuracy, as well as communicating with the authors of the 
text passages to obtain permission for use and to confirm that the content of the passage 
could still be considered current and accurate. The text passages were also reviewed and 
rated by current high school biology teachers for level of difficulty and the content’s 
appropriateness for the study sample. 



A grammatical metaphor is a process whereby meaning is constructed in the form of incongruent (i.e., 
metaphorical) grammar. Incongruence is characteristic of written discourse in relatively formal registers. 

" http://sciencenow.sciencemag.Org/cgi/content/full/2006/l 1 3/2 



6 




Item Generation 

The generation of items for the ILA (i.e., writing prompts, reading comprehension, 
and metacognitive questions) followed the document selection and was a multi-step process. 
After four of the five text selections, we included a reading comprehension section as a way 
to determine whether the quality of students’ written responses was influenced by their 
reading comprehension levels. The multiple-choice questions were developed with three 
categories of reading comprehension in mind: factual, inferential, thematic and scientific. 
After the generation of various reading comprehension questions for each text in the text set, 
two to three questions were selected per document section. 

The development of the ILA also involved creating two candidate essay prompts based 
on the content of two text sets and on the biology curriculum — with the requirement that 
they elicit higher-order thinking skills. Both prompts required students to synthesize 
information in the reading with prior knowledge. The first prompt was limited to an 
explanation task while the second prompt required students to both explain a biological 
process and develop an argument for one process over another. We used the original essay 
prompt from an existing Biology ILA developed for a National Science Foundation (NSF) 
study and then created a new prompt to serve as the comparison. 

In the late fall of the 2008-2009 school year, the two ILA prototypes were field tested 
with biology teachers to verify the appropriateness of the texts; reading comprehension and 
metacognition questions; and writing prompts. The results from this process indicated that 
students of varying competency levels would most likely be able to respond to the various 
sections of the ILA. We reviewed the student responses and determined that students were 
able to address the more difficult prompt. The second prompt was selected since it met the 
criteria of requiring students to engage in higher order thinking processes. 

Development of the History ILA 

Content Focus Selection 

Based on a survey of several RA history teachers, we found that World War II was an 
important content area that would be covered in the spring. Within this broad California 
state standard (1 1.7), we reviewed the eight sub-standards and identified one (11.7.5) as the 
ILA target content standard. We chose this standard for several reasons: First, this standard 
deals with social history and women’s history, which are both commonly addressed on 
document-based questions (DBQs), since textbooks (especially older ones) often focus more 



7 




extensively on political and military history. 3 Targeting social history on DBQs exposes 
students to a wider range of historical issues than those usually included in textbooks. 
Second, the teacher-identified standard textbook, McDougal LittelTs The Americans , 
includes enough coverage of these topics for students to include as prior knowledge. Finally, 
the textbook includes multiple primary sources for the 1 1.7.5 standard, exposing students to 
the types of document genres used in the ILA. 

The California standard 1 1.7.5 states: 

Discuss the constitutional issues and impact of events on the U.S. home front, including 
the internment of Japanese-Americans (e.g., Fred Korematsu v. United States of 
America) and the restrictions on German and Italian resident aliens; the response of the 
administration to Hitler’s atrocities against Jews and other groups; the roles of women in 
military production; and the roles and growing political demands of African Americans. 

Since the study includes Arizona teachers, we also took the Arizona state standards into 
consideration. Arizona state standard 1SS-P15, PO 2 states that instruction on World War II 
should emphasize: 

Events on the home front to support the war effort (including war bond drives, the 
mobilization of the war industry); women and minorities in the work force (including 
Rosie the Riveter); [and] the internment of Japanese-Americans (including the camps in 
Poston and on the Gila River Indian Reservation, Arizona). 

Both the California and Arizona standards still cover a broad content area; hence, we 
narrowed the ILA target to women and African-Americans on the home front. Recognizing 
that the aim of the ILA is to provide students with opportunities to demonstrate disciplinary 
thinking and reading comprehension skills in an area of instruction of which they have had 
some (but not extensive) exposure, we chose this sub-point for several reasons. First, we 
ruled out addressing Japanese-American internment since this particular topic is 
traditionally heavily covered in both California and Arizona classrooms. The California and 
Arizona standards include specific details concerning internment; furthermore, the standard 
textbook includes a breakout section on the Korematsu case. Therefore, students would 
already have received significant instruction on this content topic. Similarly, it would be 
difficult to find text prompts containing information that would be entirely new to students. 

Next, the sub-point of German and Italian resident aliens was eliminated from 
consideration in view of the fact that this topic is only briefly covered in the standard 



'Stovel, J.E. (2000). Document analysis as a tool to strengthen student writing. The History Teacher 33 (4),: 
501-509. 



8 




textbook and students would likely not have enough prior knowledge to apply to the essay. 
Thus, the final decision to focus on African-Americans was made because students would 
not have previously spent a significant amount of class time developing every relevant 
theme related to the topic; thus, they would still have enough prior knowledge to potentially 
use in their essays. Additionally, there was a large pool of documents for this topic from 
which we could confidently select text prompts that fit ILA specifications. 

Test Specification 

The first phase in developing the test specification for the ILA was to provide a 
detailed description of what skills were to be tested. Based on a review of the standards, it 
was determined that the ILA should incorporate high-level cognitive skills such as analysis, 
inteipretation, evaluation, and synthesis of the information presented in the ILA primary 
source documents combined with the material learned in history class. The tasks in the ILA 
were aimed at eliciting students’ use of higher-level cognitive skills when engaging in 
reading, analyzing, evaluating, and synthesizing the documents through writing. Figures 4 
through 7 depict the target standards related to content, analysis, reading, and writing. 



History/Social Science Standards 

Standard 1 1.7: Students analyze America’s participation in World War II. 

5. Discuss the constitutional issues and impact of events on the U.S. home front, including the internment of 
Japanese Americans (e.g., Fred Korematsu vs. United States of America) and the restrictions on German and Italian 
resident aliens; the response of the administration to Hitler's atrocities against Jews and other groups; the roles of 
women in military production; and the roles and growing political demands of African Americans. 



Figure 4. History/social science standards targeted in the ILA. 



Historical and Social Science Analysis Skills Standards 

Historical Research , Evidence, and Point of View 

2. Students identify bias and prejudice in historical interpretations. 

4. Students construct and test hypotheses; collect, evaluate, and employ information from multiple primary and 
secondary sources; and apply it in oral and written presentations. 

Historical Interpretation 

1. Students show the connections, causal and otherwise, between particular historical events and larger social, 
economic, and political trends and developments. 

3. Students interpret past events and issues within the context in which an event unfolded rather than solely in terms 
of present-day norms and values. 

4. Students understand the meaning, implication, and impact of historical events and recognize that events could 
have taken other directions. 



Figure 5. Historical and social sciences analysis skills standards targeted in the ILA. 



9 






Reading Comprehension Standards 

2.0 Read and understand grade-level-appropriate material. Analyze the organizational patterns, arguments, and 
positions advanced. 

Structural Features of Informational Materials 

2.1 Analyze both the features and the rhetorical devices of different types of public documents (e.g., policy 
statements, speeches, debates, platforms) and the way in which authors use those features and devices. 

Comprehension and Analysis of Grade-Level-Appropriate Text 

2.4 Make warranted and reasonable assertions about the author's arguments by using elements of the text to defend 
and clarify interpretations. 

2.5 Analyze an author's implicit and explicit philosophical assumptions and beliefs about a subject. 



Figure 6. Reading comprehension standards targeted in the ILA. 



Writing Standards 

Writing Strategies: 

1.0 Write coherent and focused essays that convey a well-defined perspective and tightly reasoned argument. The 
writing demonstrates students' awareness of the audience and purpose. 

Organization and Focus 

1.3 Structure ideas and arguments in a sustained, persuasive, and sophisticated way and support them with precise 
and relevant examples. 

Writing Applications: 

2.0 Combine the rhetorical strategies of narration, exposition, persuasion, and description to produce texts of at least 
1,500 words each. Student writing demonstrates a command of standard American English and the research, 
organizational, and drafting strategies outlined in Writing Standard 1.0. 

2.4 Write historical investigation reports: 

a. Use exposition, narration, description, argumentation, or some combination of rhetorical strategies to support 
the main proposition. 

b. Analyze several historical records of a single event, examining critical relationships between elements of the 
research topic. 

c. Explain the perceived reason or reasons for the similarities and differences in historical records with 
information derived from primary and secondary sources to support or enhance the presentation. 

d. Include information from all relevant perspectives and take into consideration the validity and reliability of 
sources. 

Written and Oral English Language Conventions: 

1 .0 Write and speak with a command of standard English conventions. 

1 .1 Demonstrate control of grammar, diction, and paragraph and sentence structure and an understanding of 
English usage. 

1 .2 Produce legible work that shows accurate spelling and correct punctuation and capitalization. 

Figure 7. Writing standards targeted in the ILA. 



Text Selection 

Documents were chosen for the History ILA based on several factors. First, the 
language, images, or data had to be presented in a clear and accessible way that met a grade 
1 1 high school reading level. Second, the documents needed to be directly related to the 



10 






target history content standard and to the essay prompt. Third, the documents had to point to 
larger themes embedded in both the content standard and essay prompt that students should 
develop in their essays. 

From a review of relevant literature, which largely focused on the Advanced 
Placement (AP) U.S. History Exam and the New York (NY) State U.S. History and 
Government Regents Exam, we determined that including a combination of written and 
visual texts constitutes a standard Document-Based Question (DBQ) writing practice. To 
demonstrate their disciplinary thinking skills, students should be able to read, understand, 
and analyze a wide variety of historical genres — both written and visual. Since the ILA 
target content standard focuses on social history, relevant documents were chosen to 
connect to students’ prior knowledge of the social aspects and effects of WWII on African- 
Americans on the home front. 

In a review of their language aspects, we generally found that high school history 
textbooks displayed high technicality and abstractness. This was evidenced by the frequent 
occurrences of historically specific vocabulary and the use of abstract nouns. In addition, 
various instances of grammatical metaphor (Halliday, 1994) were identified in history 

A 

textbooks . For example, experiential information (i.e., what is happening in the text) was 
frequently expressed in nominal groups through nominalization (e.g., forming the noun 
“migration” from the verb “migrate”). These nominal groups were further expanded through 
the addition of an embedded clause, an adjective, or a prepositional phrase, which resulted 
in high lexical density. Relationships between experiential elements were indicated through 
various connectors including conjunctions but verbal groups often subsumed the use of 
conjunctive relationships (e.g., “to be followed by” instead of “and then”). 

Using criteria developed as part of the RA instructional model, we also looked at 
potential text passages more holistically for consideration. This selection criterion was 
conveyed to teachers during their training to help them select appropriate text for classroom 
use. We utilized the following criteria to select text: 

• contains illustrations or graphics; 

• has internal coherence; 

• identifies a scientist/team or history authority; 

• explains the inquiry (use of evidence); 

4 A grammatical metaphor is when one grammatical structure is substituted for another, such as with 
nominalization (i.e., when a verb is used in the form of a noun). This is characteristic of written discourse in 
relatively formal registers. 



11 




• contains technical vocabulary; 

• is exposition instead of narrative; 

• has data for students to interpret; and 

• has description of methodology. 

After the initial selection, we conducted an external review of the visual and written 
texts. Current high school history teachers holistically rated and reviewed the documents for 
level of difficulty and applicability to the essay prompt (see Appendix C for the teacher 
feedback survey). 

In its entirety, the final document set — composed of three primary sources (i.e., 
newspaper article, letter, and population data table) and one secondary source (i.e., excerpt 
from a historical journal article) — presented multiple aspects of the content standard topic 
that allowed students to make generalizations, analyze cause and effect, discuss contrasting 
viewpoints, and evaluate the historical impact of the content standard topic. The document 
set included documents and readings that students would most likely not have seen; 
moreover, it might have introduced specific historical information that students had not 
discussed in their classes. Students should have applied their disciplinary thinking skills to 
analyze and interpret new information in the documents in order to integrate this data with 
their related prior knowledge and to construct an evidence-based historical narrative. 

Item Generation 

The generation of items for the ILA (i.e., writing prompts, measures of reading 
comprehension, and metacognitive questions) followed document selection and was a multi- 
step process. We included a reading comprehension section after each text in the ILA as a 
way to determine whether the quality of students’ written responses was influenced by their 
reading comprehension levels. The multiple-choice questions were developed with three 
categories of reading comprehension in mind: factual, inferential, and thematic and 
historical. Many reading comprehension questions were generated for each text. From these 
candidate items, three were selected to be included after each of the texts in the ILA. 

The development of the ILA also involved creating two candidate essay prompts based 
on the content of two text sets as well as the history curriculum — with the requirement that 
they elicit higher-order thinking skills. The essay prompt requires students to synthesize 
information in multiple documents with prior knowledge as well as elicit more disciplinary- 
specific skills through documentary analysis of change over time and historical cause and 
effect. We first evaluated DBQ questions from the past nine years of AP U.S. History exams 
and the past six years of NY State Regents exams; this added up to a total of 15 AP 



12 




questions and 16 Regents questions. AP questions routinely direct students to “analyze” 
historical change; “assess the effectiveness” of policies, reforms, etc.; assess the “extent” of 
historical change; or evaluate the “accuracy” of historical interpretations. Conversely, the 
NY Regents exams overwhelmingly ask students to “discuss” historical issues or changes. 
While both the AP and the Regents DBQs are challenging tasks that require higher-order 
thinking, the AP is expectedly more difficult. Therefore, we used the NY Regents exam as a 
model for developing the essay prompts. 

One potential ILA topic focused on African-Americans on the home front during 
WWII, while the other centered on American women during the same time period. Our 
desire was to create prompts related to the documents that could be adequately answered by 
utilizing content learned that year in U.S. History class, together with information directly 
gathered from the documents. In the fall of the 2007-2008 school year, the two ILA 
prototypes were field tested with several history teachers in the Los Angeles area to verify 
the appropriateness of the texts, reading comprehension and metacognition questions, and 
writing prompts. The results from this process indicated that students of varying 
competency levels would most likely be able to respond to the various sections of the ILA. 
We reviewed the student responses and determined that the African-American ILA would 
elicit the best student responses, since students had more prior knowledge to apply and 
seemed to demonstrate greater understanding of the texts. 

Structure of the ILA 

Overview 

The ILA instruments for biology and history (provided in Appendices A and B, 
respectively) each consisted of three parts. The first was an assessment of students’ 
knowledge of the subject matter. The second part presented students with a series of 
documents (e.g., narrative texts, graphs, illustrations, data tables). The goals of this section 
were to examine students’ reading comprehension, metacognition, and use of reading 
strategies. Reading comprehension was measured by multiple choice questions that could be 
answered using information presented in the texts. Metacognition was assessed by asking 
students to describe their reading process. The use of reading strategies was evaluated by 
reviewing students’ test forms for evidence of note-taking or other annotations. In the third 
part of the ILA, students were asked to write an essay that drew upon information garnered 
from the texts as well as their prior content knowledge and skills. These writing samples 
were rated with respect to both language and content. Additional details concerning the 
structure and scoring of each section of the ILA are described next. 



13 




Part 1: Content Knowledge 

Prior to reading the text passages, students completed a short test consisting of 10 
multiple choice questions intended to measure students’ existing content knowledge. For the 
Biology ILA, these items were selected from the CST Biology test, the SAT II exam, the AP 
Biology exam, and preparation resources for these tests. The History ILA consisted of 10 
multiple choice questions relating to African-American history of the late-nineteenth to mid- 
twentieth centuries( particularly to African-American involvement during WWII) and more 
generally to WWII social history. The items were selected from a pool of publicly released 
CST History items, AP U.S. History items, N.Y. Regents U.S. History items, and related 
test preparation resources. The questions in this first section of the ILA drew upon students’ 
knowledge of the particular subject areas and were administered to aid the interpretation of 
scores on the passage-based multiple choice questions in the subsequent section. 

Part 2: Reading Comprehension, Metacognition, and Reading Strategies 

In the second part of the ILA, students were asked to read a series of passages, answer 
multiple choice questions related to those passages, and reflect on their reading process. In 
addition, students’ test booklets were examined for evidence of their utilization of reading 
strategies. 

Reading Comprehension. The multiple choice questions in Part 2 of the ILA were 
intended to measure students’ reading comprehension. Questions were aligned with the 
passages in such a way that that it would have been possible for students to find relevant 
information within the passage and provide a correct response — regardless of their prior 
knowledge of the subject-matter. However, due to the fact that the questions still draw on 
content knowledge, it should be noted that students could perhaps provide correct answers 
by relying primarily on their prior knowledge — not only on the particular knowledge gained 
by reading and comprehending the text at hand. In other words, a student might compensate 
for low reading comprehension with high prior knowledge or compensate for low prior 
knowledge with high comprehension. As such, scores on these items are best interpreted 
alongside students’ content knowledge scores from Section I of the ILA. This point will be 
further addressed in our analysis and discussion of the data. 

Metacognition Scoring Rubric. After completing the multiple choice questions, 
students were encouraged to reflect on their thought process and describe how they 
approached the reading passages. This metacognition item was designed by WestEd with 
input from CRESST. The question was designed to measure the degree to which students 



14 




were aware of the thought processes they had utilized in reading the documents. Students 
were asked to respond to the following question: 

Parts of this document were complex. What did you do as you were reading to improve 

your understanding? Please be as detailed as possible. 

The metacognition scoring rubric (see Appendix D) was adapted from previous RA work. 
Students’ responses were rated on a 4-point scale. The profile of a score point could be 
broken down into three main criteria: the degrees to which the student (a) engages with 
complexities in the text or with the ideas that require attention; (b) describes thinking 
processes that occur while reading; and (c) explains an approach to how he or she thinks 
about the reading. Additionally, raters considered how aware students were of their 
thinking, their degree of self-monitoring, and lastly, their executive control. 

Reading Strategies Scoring Rubric. In developing the reading strategies rubric (see 
Appendix E), we modified the NSF Reading Process rubric that was based on the Strategic 
Literacy Initiative’s CERA assessment. In particular, we extended their work to produce a 
rubric geared towards use in large-scale scoring sessions. The key points of the rubric 
address students’ reading engagement, based on the reading strategy dimensions identified 
by the RA approach to content area reading. The rubric was applied to annotations made on 
the texts presented in Part 2 of the ILA. 

The Reading Strategies rubric was based on a 4-point scale. The profile of a score 
point could be broken down into three main criteria: consideration of the frequency of 
annotations, the variety in the annotations, and the types of reading strategies used (i.e., 
general versus discipline-specific). The strategies assessed were drawn from the RA theory 
of content area reading. Table 1 provides additional information about the evidence that 
raters looked for while rating as well as the types of reading strategies utilized by students. 



15 




Table 1 

Descriptions of Annotations and Reading Strategies 



Biology reading 

Text annotations General reading strategies strategies 



History reading 
strategies 



• Markings 

• Underlines 

• Highlights 

• Circlings/boxings 

• Connecting lines 
and arrows 

• Symbols 

• Comments 

• Questions 

• Statements 



Identifying key 
vocabulary 
Identifying unknown 
vocabulary 

Attempting to define 
unknown vocabulary 
(e.g., through 
identifying root words, 
looking ahead in the 
text for a definition) 
Identifying the main 
ideas of the text 

Paraphrasing 
Summarizing 
Predicting the content 
of text sections 

Identifying confusions 
Using context clues to 
build understanding 



• Connecting 
to/applying prior 
biology knowledge 

• Questioning 
scientific methods 

• Attending to and 
evaluating 
evidence 

• Analyzing graphs, 
diagrams and other 
visual aids, 
including 

organizing/represen 
ting data 

• Considering the 
implications of 
science beyond the 
text’s scope 



Making 

connections 

to prior 

history 

knowledge 

Linking ideas 

together 

within a 

document 

and/or across 

documents 

(intertextual 

reading) 

Evaluating 
the source of 
a document 
Determining 
bias or point 
of view 

Considering 
the document 
in historical 
context 
Identifying 
cause and 
effect 



Note. Evidence for text annotations found only on text passage. 

A student who received a score of 4, for example, would have displayed a strong use 
of reading strategies demonstrated through annotations throughout the set of texts; 
employed a variety of annotations; and shown evidence of using at least one discipline- 
specific reading strategy. In contrast, a student receiving a score of 1 would have shown 
little or no evidence of the use of reading strategies. In this case, the annotations may have 
been minimal, disconnected, or indiscriminate (e.g., large sections of the passage 
highlighted or underlined lacking an apparent purpose). 

Part 3: Writing Assessment 

Parts 1 and 2 of the ILA were administered together. During the following ILA 
administration, half of the students moved onto Part 3; whereas, the other half completed a 
different assessment called the Degrees of Reading Power (DRP). Part 3 of the ILA is a 



16 




writing task that directs students to write an essay integrating information from the 
documents with knowledge they have learned in their biology or history class. For the 
biology test, in order to help students approach the task as one of scientific explanation and 
argumentation, students were instructed to imagine that they were biologists advising a 
farmer about preventing crop destruction. Students were specifically directed to include an 
explanation of the recombinant DNA process, a description of the safety concerns this 
process presents, and an argument supporting either traditional cross-breeding or genetic 
engineering. For the history test, in order to help them approach the task as one of historical 
explanation, students were instructed to imagine that they were journalists writing about 
African-Americans’ experiences on the home front during WWII. Students were specifically 
directed to include discussions of labor discrimination, migration, and racial violence; 
develop larger themes; and provide analysis in their essays. 

Writing Rubrics. The scoring rubrics for Part 3 of the ILA (see Appendices F and G) 
address issues of language and academic writing within the science or history genre. We 
adapted previously developed and validated rubrics (from NSF Biology ILA scoring), which 
evaluated student content and language knowledge along two separate dimensions. Our 
language rubric followed a linguistic analysis of academic language and writing practices 
and also reflected grade 11 English language arts standards. Both the writing content and 
writing language rubrics utilize a 4-point rating scale. Through the characterization of their 
respective score points, both rubrics describe various aspects of writing proficiency. Each 
score point within a given rubric provides a portrait of students’ explanations as they may 
appear at a given proficiency level. 

Rationale for Two Writing Rubric Dimensions. Our evaluation of commonly used 
performance assessments revealed that language expectations are often implicitly embedded 
within the assessment criteria. Based on a review of performance assessments used in high 
school biology and history settings, we found a reoccurring discrepancy between assessment 
scoring criteria and performance expectations. For example, in the AP exam scoring 
guidelines, points are awarded to student writing based on the inclusion of certain content 
information. However, the AP scoring guidelines also specify that high scoring essays will 
be “well organized” and “well written,” without further discussion of the specific features 
that constitute these writing characteristics. The final score is the accumulation of these 
points. The scoring rubric for the NY Regents U.S. History exam combines aspects of essay 
organization (e.g., inclusion of an introduction and conclusion) with content-focused criteria 
of document analysis and the incorporation of relevant outside knowledge. Similar problems 
were found in the scoring of routine, in-class writing tasks. For example, in the 2006 



17 




Prentice Hall biology textbook, students are asked to complete writing assessments called 
“Writing in Science” as part of the end-of-chapter assessments. This task entails writing a 
paragraph or group of paragraphs on target biology content. Like the AP Biology writing 
exam, Prentice Hall’s writing assessment criteria explicitly refer only to the scoring of 
content. For example, in one prompt, students are asked to write a paragraph that includes 
(a) an explanation of a polymer; (b) a description of organic compounds; and (c) how these 
organic compounds are used in the human body. Notably, the evaluation criteria relate only 
to biology content (e.g., one of the criteria requires that students “explain that a polymer is a 
macromolecule made up of monomers joined together”). None of the evaluation criteria 
pertain specifically to the language features needed to successfully provide an explanation 
of the content. As with the AP Biology exam, students are expected to communicate science 
concepts using academic language, though these literacy skills are only implicitly evaluated 
as part of the assessment score. 

Since the scoring guidelines for tests and writing tasks often conflate content and 
language, it is unclear whether raters’ scores measure content understanding or a 
combination of content understanding and students’ literacy skills for describing, analyzing, 
and explaining. Without explicit (and separate) scoring criteria to evaluate language and 
literacy skills, it is difficult to determine the extent to which writing quality should reflect 
literacy/writing skills versus content knowledge. In order to measure student performance 
on the written explanation task, we developed two separate rubrics to evaluate biology 
content knowledge and academic language proficiency in the student written explanations, 
with both constructs expected to be impacted by RA instruction and students’ use of RA 
strategies. 

Writing Content Scoring Rubric. For the content rubric (see Appendix F), criteria 
were formed, in part, by using the previously developed and validated CRESST rubrics; AP 
scoring guidelines; and NY Regents test rubrics as guides. Our goal was to measure 
students’ conceptual knowledge; ability to connect principles and concepts; and capability 
to extend prior knowledge of concepts (beyond the limited contexts in which they were 
acquired), in order to create well-developed explanations. Based on this goal, we developed 
a list of four initial key points upon which to base our rubric: (a) understanding of the target 
discipline-specific content; (b) clarity of explanation; (c) use of supportive evidence from 
the provided texts; and (d) inclusion of prior knowledge. 

Both writing rubrics (language and content) were rated on a 4-point scale, with each 
score reflecting different aspects of writing proficiency. The rubrics provide a portrait of a 
student’s biology explanation as it may appear at a given proficiency level. 



18 




A student response receiving a high writing content score had to satisfy most or all of 
the scoring criteria, which were elaborated in the rubric’s 4-point description. Specifically, 
the response demonstrated well-developed understanding of the target content. In addition, 
this content was clear, focused, thoroughly explained, and elaborated with strong supportive 
evidence. The content dimension also encompassed whether or not a student demonstrated 
relevant knowledge that extended beyond information explicitly given in the text passage 
(i.e., whether or not a student incorporated prior knowledge). Lastly, this dimension focused 
on the extent to which students incorporated relevant information from the texts into their 
responses. The specific content raters were to look for in student responses was elaborated 
in the supplemental documents for the writing content rubric. 

Together, these aspects of the rubric were collectively expected to measure content 
understanding and students’ ability to successfully meet a fairly demanding cognitive 
challenge. Specifically, in addition to possessing the necessary content knowledge, in order 
to score well on this task, students needed to apply complex cognitive skills, such as textual 
analysis and synthesis of historical information, from multiple sources. 

Writing Language Scoring Rubric. In developing the ILA Writing Language Rubric 
(see Appendix G), we modified the language dimensions that were previously developed 
and validated in earlier CRESST work (see Aguirre-Munoz et al., 2005) in order to align 
them with the RA instructional model; a discipline-specific setting; and the explanation 
genre. Key points were used to evaluate students’ academic language proficiency on the 
ILA, based on the dimensions identified as significant in academic writing. The language 
rubric specifically focuses on assessing students’ linguistic command of grammatical 
structures that are directly related to the explanation genre and that are also aligned with the 
California Content Standards in writing. Additionally, the measured language features 
include those that students frequently become aware of during their analyses of text schemas 
and text structures in the RA instructional model. For students in RA classrooms, the 
language rubric also implicitly measures how well students are able to transfer the academic 
language they have become familiar with in the Reading Strategies into their writing 
process. Specifically, the language rubric measured three concepts that define the overall 
qualities of a historical or scientific explanation. These include: (1) appropriate text 
cohesion, (2) varied and precise word choice, and (3) a formal, impersonal tone. 

As we looked for text cohesion, we checked for sentence structure variety and the use 
of expressions of causality through the use of nominalization (i.e., noun phrases used in 
place of verb form), causative verbs (e.g., led to, resulted from), and/or transitional 
expressions. In looking for precise and varied word choice, we checked for discipline- 



19 




specific vocabulary, as well as everyday terms used with subject-specific meanings. In both 
cases, we looked for these words to be organized as part of expanded noun phrases (e.g., 
because of racial discrimination, many blacks decided to pack up and get out of the rural 
south). For evidence of an impersonal and authoritative tone, we looked for use of third 
person, passive voice, and for the presence of few or no speech markers (e.g., “well”, “you 
know”, “like”). While some debate exists in the field as to whether an authoritative tone is 
necessary for good written communication, it remains the standard for academic writing; 
thus, it is a key aspect of how we have defined and measured appropriate academic language 
use in our language rubric. 

Based on previous CRESST work (see Aguirre-Munoz et al., 2005), we knew that 
most students in the early years of high school do not have the academic language 
proficiency to produce high-quality academic explanations. For this reason, the language 
rubric was structured to sensitively measure a range of academic language proficiency levels 
in science and history writing. We related the ideas of abstraction, informational density, 
and technicality to three systemic functional linguistic concepts. Mode (the manner in which 
ideas are communicated) refers to students’ ability to create appropriate text cohesion in 
their writing. Field (the linguistic elements used to communicate those ideas) signifies 
students’ ability to use varied and precise word choice. Tenor (the tone of that 
communication) refers to students’ ability to establish a formal, impersonal tone in their 
writing. 

In order to receive a high score on the language dimension, a student’s explanation 
had to meet most or all of the following criteria: demonstration of very good text cohesion 
through regular use of sentence structure variety (specifically, through use of marked 
themes); consistent use of precise and varied word choice (specifically, through use of 
expanded noun phrases); and use of an impersonal and authoritative tone with few or no 
speech markers. The length of a student’s paper was taken into consideration to the extent 
that the writing needed to be long enough to provide evidence of academic language 
proficiency. Further discussion of the writing language scoring rubric is provided in 
Appendix H. 

Methods 

Sample 

Sixty-one biology teachers (i.e., 20 men and 41 women), representing 54 public high 
schools in California, agreed to participate in the study. Their length of teaching experience 
at the onset of the RA training ranged from 1 to 36 years, with an average of 1 1 years. 



20 




Teachers in the treatment group participated in the initial RA professional development 
during the summer of 2007 and then attended follow-up sessions during the school year. 
The Biology ILA was administered at the end of the 2008-2009 school year. A total of 825 
ILA Part 1, 798 ILA Part 2, and 383 ILA Part 3 student assessments were collected from 47 
biology teachers. 

Sixty-three history teachers from 56 California public high schools participated in the 
study. The sample included roughly an equal number of male (31) and female (32) teachers. 
Their length of teaching experience at the onset of the training year ranged from 2 to 37 
years — with an average of 12 years. Two cohorts of history teachers were trained in the RA 
program. The first cohort participated in the initial professional development during the 
summer of 2006 and administered the History ILA at the end of the 2007-2008 school year. 
Teachers in the second cohort began their training during the summer of 2007 and 
administered the History ILA at the end of the 2008-2009 school year. A total of 869 ILA 
Part 1, 850 ILA Part 2, and 391 ILA Part 3 student assessments were collected from 49 
history teachers. 

The Scoring Session 

CRESST researchers trained teams of raters to score Parts 2 and 3 of the ILA during 
the summers following their administration. The training and scoring sessions were held 
over several days. To minimize rater bias, all identifying information (student names, 
teacher names, school names) was removed from the student papers. In addition, the test 
booklets did not include any markings related to treatment group assignment. Responses 
were randomly distributed into packets containing approximately 20 responses each. 

All raters underwent intensive training to learn and practice implementing the scoring 
procedures. These sessions also provided opportunities to address raters’ questions and 
ensure that the scoring rubrics were clear. Raters received two days of training on the 
content and language rubrics and a half day of training for the reading strategies and 
metacognition rubrics. The training was followed by a scoring session. Within each scoring 
session, students’ responses were read and scored by two different raters. The final scores 
were obtained by taking the arithmetic mean of the scores assigned by two raters, thereby 
reducing the influence of rater variability. 

Reliability of ILA Scores 

A series of generalizability studies were conducted in order to examine the reliability 
of the ILA components. Generalizability theory (see e.g., Cronbach et al., 1972; Shavelson 
& Webb, 1991) explicitly acknowledges that some universe of acceptable observations 



21 




exists that is larger than the set of test conditions within a given study. Moreover, we would 
view any sample of observations drawn from that universe as being equally acceptable. In 
the case of the ILA, this means that we would not want scores to depend greatly on the 
particular test items that students were given or the particular raters who assigned scores. 
Generalizability theory, then, provides a framework for understanding the extent to which 
variability in observed scores can be attributed to various aspects of the measurement 
design. Importantly, it allows simultaneous treatment of these design features (though, in the 
case of the student ILA scores, only single facet designs were used). This is in contrast to 
more classical approaches, in which only a single source of measurement error is considered 
at a time, leading to the calculation of multiple reliability coefficients (inter-rater reliability, 
internal consistency, test-retest reliability, etc.), which can make it difficult to assess the 
overall dependability of a measure. Here, we describe findings from generalizability studies 
for the metacognition and reading strategies items in Part 2 as well as the writing language 
and content scores from Part 3. In addition, we present an examination of students’ scores 
on the multiple choice tests in Parts 1 and 2 of the ILA. For each measure, we present 
estimates of the reliability coefficients based on the measurement designs used in this study. 
However, it should be noted that generalizability studies provide valuable information that 
could inform the design of future assessments, including use of the ILA in future studies. 

Two coefficients are calculated for each score. The first, p 2 , describes the reliability 
of the score for relative decisions and is roughly equivalent to the squared correlation 
between the observed scores and those that might be obtained by averaging over many 
repeated observations (the universe score). It is calculated as the proportion of expected 
variance in observed scores (07 +<Tr c/ ) that is due to variance in universe scores (d 2 ). 

This coefficient can be considered the extent to which the measure provides a consistent 
rank ordering of students. The second coefficient, ^ (also known as the index of 

dependability; Brennan & Kane, 1977), describes the proportion of total variance in 
observed scores (07 + <j 2 Abs ) that is attributable to variability in the universe score. It reflects 

the reliability of the scores for absolute decisions (when the magnitude of the score of 
interest and not only the rank ordering of students). Formulas for both p 2 and ^ are shown 

below. 



p- = 



fo 2 +°LY ^ (<* 2 +^.L) 



( 7 . 



As evident in the formulas, the two indices differ only in their denominator. In both 
cases, the denominator is expressed as a sum of “true” variance (07 ) and error variance 



22 




(either <7 2 c/ or d \ bs ). The difference between the two is simply in how the error variance is 
calculated. In the case of <7 2 c/ , only variance components that represent interactions with 
students (and thus affect rank ordering) are considered. For a Abs , both interactions and main 
effects are considered. Thus, a 2 Abs is always equal to or larger than &l el . As a consequence, 

/\ “7 . A 

p~ is always equal to or greater than </ . 

For the ILA, two measurement designs were utilized. In regards to the multiple choice 
tests for content knowledge and reading comprehension, scores reflect an averaging across 
the items of each test. This corresponds to a students-by-items (5x/) design. Here, variance 
components for students (07) and items (<r 2 ) are estimated, along with a residual term 
( c7 2 e ). The subscript of the residual reflects the fact that this term is actually a sum of the 

variance due to the interaction of students and items (si) and additional unexplained random 
variance ( e ). Since this is a design with only one facet, the variance <r 2 c/ is equal to <j 2 e , 

divided by the number of items; <t 2 ,, v is the sum of <7 2 and o ] ie , divided by the number of 

items. The scores for reading strategies, metacognition, writing content, and writing 
language were based on averages of the scores assigned by multiple raters, a students-by- 
raters ( S x R ) design. The variance components estimated for these scores include those for 
students (<7 2 ), raters (<J 2 raters ), and the residual term (<r 2 e ). Here, the variance o-^ c/ is equal 

to <7 2 e , divided by the number of raters; <r 2 /n is the sum of 6 2 and <7 2 e , divided by the 
number of raters. 

Reliability coefficients were estimated from samples of student ILA responses 
randomly drawn from the full scoring samples in order to estimate the generalizability 
coefficients for the content knowledge and reading comprehension scores. For scores 
obtained from Parts 2 and 3 of the ILA, coefficients were estimated from either random 
samples from the scoring sample or from independent (calibration) samples scored by 
multiple raters. Estimates of the variance components for scores on the biology and history 
assessments are summarized in Tables 2 and 3, respectively. The final column of these 
tables present the proportions of variance attributed to each component. Larger values for 
the component attributed to students are desired, as they result in larger reliability 
coefficients. On the other hand, these proportions should not be directly compared across 
scores, since the measurement designs differ. Specifically, scores on the content knowledge 
and reading comprehension tests are obtained by averaging over the test items, while other 
ILA scores result from averaging over raters. Nevertheless, it is somewhat concerning that 
the percentages related to <7 2 are rather small (relative to those for 6 2 and 07^) for the 

multiple choice tests of content knowledge and reading comprehension, compared to the 



23 




other ILA scores. The estimates for the main effect of items (df ) reflect variation in the 
difficulty of items, while the large estimates for the residual term suggest substantial person- 
by-item interaction (i.e., different items give different rank ordering of students), a large 
amount of unexplained variance in scores, or both. We will return to these tests in the 
subsequent section. It appears that the estimates are more reasonable for the other measures. 
The small percentages related to the rater facet (o', 2 ) indicate that the raters were quite 
consistent in the severity of their ratings. The estimates for the a 1 term (and the 

corresponding percentages), suggest that student-by-rater interactions and unexplained 
random error contributed more to the observed variability in scores than the main effect of 
raters. 

Table 2 

Variance Component Estimates for Biology ILA Scores 



Measure 


Source of variation 


Component 


Estimate 44 


% total 


Content knowledge 


Students (s) 


- 2 


.013 


5.3 


(100 students, 10 items) 


Items (0 


- 2 


.070 


27.7 




Residual (si,e) 




.169 


67.0 


Reading comprehension 4 


Students (s) 


*, 2 


.035 


14.1 


(100 students, 10 items) 


Items (/) 


^, 2 


.022 


8.6 




Residual (si,e) 




.194 


77.3 


Metacognition * 


Students ( s ) 




.202 


43.6 


(20 students, 8 raters) 


Raters (r) 




.032 


6.8 




Residual (sr,e) 


Ke 


.230 


49.6 


Reading strategies 4 


Students ( s ) 




1.114 


87.5 


(20 students, 8 raters) 


Raters (r) 




.032 


2.5 




Residual (sr,e) 


/v 7 

(T~ 

sr,e 


.128 


10.0 



24 




Measure 


Source of variation 


Component 


Estimate 


% total 


Writing - content 


Students ( s ) 


*, 2 


.963 


76.8 


(20 students, 9 raters) 


Raters (r) 


*, 2 


.024 


1.9 




Residual (sr,e) 


°l,e 


.266 


21.2 


Writing - language 


Students (s) 


~ 2 


.835 


72.4 


(20 students, 9 raters) 


Raters (r) 




.073 


6.4 




Residual (sr,e) 


<?l,e 


.245 


21.2 



Note. Content knowledge and reading comprehension estimates based on groups of 100 students 
randomly selected from the full scoring sample. Analyses of scores for Parts 2 and 3 based on 
reliability samples of 20 students and 8 or 9 raters (depending on measure). Variance component 
estimates obtained using random effects ANOVA. 

Table 3 



Variance Component Estimates for History ILA Scores 



Measure 


Source of variation 


Component 


. *** 

Estimate 


% Total 


Content knowledge 


Students (s) 




.016 


6.6 


(100 students, 10 items) 


Items (/) 


^, 2 


.038 


15.3 




Residual ( si,e ) 


Ke 


.194 


78.1 


Reading comprehension 4 


Students ( s ) 




.025 


10.2 


(100 students, 12 items) 


Items (/) 




.040 


16.0 




Residual ( si,e ) 




.184 


73.8 


Metacognition 


Students (s) 




.412 


79.7 


( 1 00 students, 2 raters) 


Raters (r) 


-, 2 


.003 


0.6 




Residual (sr,e) 


« 2 

(7 

sr,e 


.102 


19.7 


Reading strategies* 4 


Students ( s ) 




1.564 


92.7 


(5 students, 7 raters) 


Raters ( r ) 


-, 2 


.031 


1.8 




Residual (sr,e) 


-L 


.093 


5.5 



25 




Measure 


Source of variation 


Component 


. #** 

Estimate 


% Total 


Writing - content* 


Students (s) 


*. 2 


.355 


49.5 


(20 students, 5 raters) 


Raters (r) 


^, 2 


.000 


.0 




Residual (sr,e) 


G 2 

sr,e 


.362 


50.5 


Writing - language 


Students (s) 


- 2 


.465 


57.8 


(20 students, 5 raters) 


Raters ( r ) 




.057 


7.1 




Residual (sr,e) 


G 2 

xr.e 


.283 


35.1 



Note: Content knowledge, reading comprehension, and metacognition estimates based on groups of 100 
students randomly selected from the full scoring sample. Analyses of scores for reading strategies and 
writing scores based on reliability study of samples with varying numbers of students and raters 
(depending on measure). Variance component estimates obtained using random effects analysis of 
variance. Negative estimates set to zero. 



. . , a A 1 A a 

As previously described, coefficients p~ and p were calculated from the variance 
component estimates and the number of observations for each of the design facets (i.e., the 
numbers of items and raters used in actual scoring); these results are shown in Table 4. As 
expected from the results in Tables 2 and 3, reliability estimates are somewhat low for the 
multiple choice tests (content knowledge and reading comprehensions) but in a more 
acceptable range for the other measures. 



26 




Table 4 



Coefficients for Relative and Absolute Decisions for Biology 1LA Scores 



Measure 


n ieveis (items or raters) 


Relative decisions 


Absolute decisions 


Biology I LA 


Content knowledge 


10 


.44 


.36 


Reading comprehension 


10 


.65 


.62 


Metacognition 


2 


.64 


.61 


Reading strategies 


3 


.96 


.95 


Writing-content 


2 


.88 


.87 


Writing-language 


2 


.87 


.84 


History ILA 


Content knowledge 


10 


.46 


.41 


Reading comprehension 


12 


.62 


.58 


Metacognition 


2 


.90 


.89 


Reading strategies 


3 


.97 


.96 


Writing-content 


2 


.66 


.66 


Writing-language 


2 


.77 


.73 



Note. Based on estimated variance components (Tables 2 and 3) and number of facet levels (items or raters) in 
the measurement design. 



It should be noted that the generalizability coefficient p 2 for the multiple choice tests 
is equivalent to Cronbach's alpha (internal consistency), which may be viewed as a measure 
of the average correlation among items on a test. However, this index is most interpretable 
for uni-dimensional tests. The presence of multiple dimensions (i.e., multiple constructs 
influencing test responses) could result in biased estimates of reliability, though the 
direction of such bias would depend on the nature of the relationships between dimensions. 
Thus, we consider possible violations of uni-dimensionality in the tests of knowledge and 
reading comprehension. 

Tables 5 and 6 presents descriptive statistics for each item in the tests of content 
knowledge, including the percent of respondents with correct answers and the correlation 
between item and total score on the remaining items in the test. Here, it is evident that these 
tests include items that reduce the internal consistency of the scale (resulting in a smaller 



27 




generalizability coefficient). Specifically, items 1, 2, 5, and 6 from the Biology ILA and 
item 10 from the History ILA have rather weak correlations with other items on the test. 
Analyses of item responses suggest that the poor performance of these items may be due to 
students having difficulty choosing between available response choices. It is notable that the 
percentages of students answering these items correctly were low for each of these five 
questions. This may create a floor effect of sorts, where even high achieving students (as 
demonstrated in their responses to other questions) seemed to do no better on these items 
than what might be expected if they were simply guessing. An alternative explanation could 
be that these items measure abilities that are qualitatively different from the remainder of the 
test (i.e., the test is multidimensional). Whatever the cause, the internal consistency of the 
test can actually be increased if the four problematic items are removed. The last columns of 
Tables 5 and 6 show that the item-test correlations are generally larger once the problematic 
items are removed. 

Similar analyses were conducted for reading comprehension tests. Tables 7 and 8 
present descriptive statistics for these tests. Item 6 from the Biology ILA and item 3 from 
the History ILA both appear to be problematic. The correlation between the scores on these 
items and the total scores on the remaining items are rather close to zero. Reanalyzing the 
tests without these items produces very little change in the item-test correlations. 

Response data for the content knowledge and reading comprehension tests were also 
analyzed within an item response theory (IRT) framework. Appendix G presents a summary 
of the results for the full- and reduced-length tests for both the Biology and History ILA 
instruments. A three-parameter logistic (3PL) model with two correlated factors 
(corresponding to the two tests, content knowledge and reading comprehension) was used. 
The 3PL model estimates discrimination, intercept, and guessing parameters for each item. 



28 




Table 5 



Descriptive Test Statistics for the Biology Test of Content Knowledge (Biology 
ILA Part 1) 



Item 


% of students 


Corrected item-total correlation 


answering correctly 


Full test 


Reduced test d 


1 


31.4 


.02 


NA 


2 


6.4 


-.07 


NA 


3 


42.8 


.24 


.23 


4 


73.7 


.21 


.28 


5 


13.6 


.02 


NA 


6 


18.9 


-.04 


NA 


7 


82.4 


.18 


.26 


8 


54.5 


.23 


.29 


9 


36.1 


.21 


.26 


10 


61.1 


.18 


.22 



Note * Reduced test excludes items 1, 2, 5, and 6. 



Table 6 

Descriptive Test Statistics for the History Test of Content Knowledge (History 
ILA Part 1) 



Item 


Of . j . 


Corrected item-total correlation 


% ot students 
answering correctly 


Full test 


Reduced test d 


1 


79.3 


.23 


.24 


2 


48.6 


.32 


.33 


3 


67.1 


.34 


.33 


4 


67.9 


.14 


.15 


5 


43.6 


.27 


.27 


6 


38.3 


.13 


.12 


7 


64.7 


.29 


.27 


8 


56.4 


.18 


.19 


9 


81.6 


.14 


.14 


10 


26.7 


.13 


NA 



Note * Reduced test excludes items 10. 



29 







Table 7 



Descriptive Test Statistics for the Biology Test of Reading Comprehension (1LA 
Part 2) 



Item 


% of students 


Corrected item-total correlation 


answering correctly 


Full test 


Reduced test a 


1 


69.5 


.27 


.28 


2 


56.8 


.23 


.24 


3 


51.3 


.28 


.28 


4 


53.4 


.21 


.21 


5 


50.4 


.29 


.31 


6 


20.1 


.01 


NA 


7 


50.4 


.29 


.30 


8 


71.6 


.31 


.32 


9 


40.0 


.19 


.18 


10 


66.4 


.23 


.24 



Note* Reduced test excludes item 6. 



Table 8 



Descriptive Test Statistics for the History Test of Reading Comprehension 
(History ILA Part 2) 



Item 


% of students 


Corrected item-total correlation 


answering correctly 


Full test 


Reduced test 3 


1 


32.7 


.20 


.19 


2 


56.2 


.24 


.25 


3 


20.8 


.09 


NA 


4 


59.6 


.24 


.24 


5 


69.6 


.29 


.29 


6 


72.7 


.32 


.32 


7 


64.1 


.18 


.19 


8 


81.6 


.33 


.33 


9 


57.5 


.19 


.20 


10 


74.6 


.29 


.30 


11 


73.3 


.29 


.29 


12 


41.4 


.28 


.27 



Note * Reduced test excludes item 3. 



30 








The discrimination parameter (or slope) is analogous to the item-test correlations 
presented in Tables 5 through 8; it represents how well the item discriminates between 
individuals who differ on the latent trait. The intercept parameter is related to both the slope 
and the difficulty of an item (i.e., the percentage of students correctly answering a question). 
The guessing parameter accounts for the fact that even individuals with low ability levels 
have some nonzero probability of choosing the correct response. 

When a confirmatory factor model was fit to the test data with single factors for each 
of the two 10-item tests, there was evidence of bias in the item parameter estimates due to 
the same items that appeared problematic in the descriptive analyses. When these items 
were removed, the resulting parameter estimates were in a more reasonable range. Table 9 
shows estimates of the reliability coefficients P and ^ for the full- and reduced-length 
tests. The reliability coefficients increase for each test when the problematic items are 
excluded, though in some cases the change is quite small. 

Table 9 



Coefficients for Relative and Absolute Decisions for the Biology ILA Tests of Content Knowledge and 
Reading Comprehension, Based on Estimated Variance Components. 



Measure 


ttilems 


Relative decisions 


Absolute 

decisions 


Biology ILA - Content knowledge 


All items 


10 


.44 


.36 


Reduced test' 


6 


.50 


.47 


Biology ILA - Reading comprehension 


All items 


10 


.65 


.62 


Reduced test " 


9 


.66 


.65 


History ILA - Content knowledge 


All items 


10 


.46 


.41 


Reduced test ' 


9 


.46 


.42 


History ILA - Reading comprehension 


All items 


12 


.65 


.62 


Reduced test 


11 


.68 


.67 



Note. Omits items 1,2, 5, 6. "'Omits item 6. "Omits item 10. ' Omits item 6. 



Taken together, the descriptive and IRT analyses suggest that scores from the reduced 
tests may be preferable to the full-length tests. In the subsequent section, analyses are 



31 




conducted using these shorter versions of the tests, in which the problematic items are 
omitted. 



Theoretical and Statistical Models of RA Effects 

A possible model for the effects of the RA program is presented in Figure 4. 
Participation in the program is expected to result in certain changes in teachers’ instructional 
practices. These, in turn, may affect how students approach reading. The ILA metacognition 
and reading strategies scores are intended to measure such changes. Both the use of 
particular reading strategies and improved metacognition may contribute to students’ 
reading comprehension and other desired outcomes. Additional measures were used to 
examine variations in instruction. Although the development and properties of these 
measures are beyond the scope of this report, in order to more fully examine the plausibility 
of the model and to aid the interpretation of the student-level variables, some results based 
on their use are presented here. 

Professional Development 

i 

Instruction 

(Teacher Implementation of Curriculum) 

4 

Utilization of Reading Strategies 
(Student Implementation of Curriculum) 

4 

Student Outcomes 
(Reading Comprehension, etc.) 

Figure 4. A possible model for the hypothesized effects of the program. 

To estimate the effect of treatment group assignment on the student-level variables, we 
fit a series of hierarchical linear models to the ILA scores. A multi-level approach is needed 
in order to acknowledge the fact that students in the study are not independent; rather, they 
are nested within classrooms. All the models follow the same basic structure. The level- 1 



32 







(student-level) equation relates the observed ILA score ( Y i} ) to a class-level mean ( /} 0j ), 
plus a residual term ( r (j ): 

y,j=P»j + rj, Tj ~ jV(o,<t 2 ) 

The level-2 (class-level) equation presents the class-level mean ( /3 0j ) as a sum of a grand 
mean (y {) ), the product of the treatment effect (/, ) and an indicator of class-level treatment 
status ( TREATMENT . , a variable with value 0 or 1 for the control and treatment groups, 
respectively), and a class-level residual term (u . ): 

Po, =7o+r< ( TREATMENT t )+ Uj k, ~ n(q, r 2 ) 

Linear regression models were used to examine the relationship between treatment 
assignment and teacher-level variables. The form of these models is essentially the same as 
the level-2 equation in the multi-level models. Specifically, teachers’ implementation scores 
(T. ) are modeled as the sum of a grand mean (/? 0 ), the product of the treatment effect (/L) 

and treatment status ( TREATMENT . ), and a residual term (e i ): 

Y j = fi 0 + /?, (: TREATMENT . )+ e jt e / ~ /vfo.cr 2 ) 

In addition to fitting the various multi-level and regression models, we calculated the 
Pearson correlations between the study variables. To account for the nested data structure, 
student ILA scores were first averaged within classrooms. 

Effects of treatment assignment on teacher instruction variables (literacy instruction, 
content coverage) were estimated via ordinary least squares regression. Effects on student 
implementation variables (metacognition, reading strategies) and other student outcomes 
(content knowledge, reading comprehension, writing content, and writing language) were 
estimated using hierarchical linear models. 



33 




Results 



Estimated Treatment Effects and Correlations Among Study Variables 

Tables 10 and 11 present the results of the analyses described above for the biology 

and history samples, respectively. For the Biology ILA, the estimated effect of assignment 
was positive for literacy instruction (/? y = .20, pc.Ol). No effect was observed on the 

measure of content coverage. For the history sample, assignment to the RA program had 
positive effect on both literacy instruction (/?.=. 21, p<. 01) and content coverage 

(/3j = .10, p<. 05). The positive effects on literacy instruction are consistent with the 

intended goals of the RA curriculum. However, it would be possible for that emphasis to 
come at the expense of other aspects of the curriculum. Based on this measure of content 
coverage, however, coverage was similar for the treatment and control groups. Although the 
estimated effects of assignment on metacognition and reading strategies were positive for 
both the biology and history samples, none of these effects were statistically significant. 
Estimated effects on other student outcomes varied in direction; that is, there were some 
positive and negative effects. However, these also were not significant. In sum, there is 
evidence that assignment to the RA group had a positive effect on the intended teacher 
practice, which is literacy instruction. However, no significant effects were observed on 
more distal variables. 



34 




Table 10 



Analyses of Treatment Effects and Correlations Between Study Variables - Biology Sample 



Variable 


Trt effect 






Pearson correlations of class level means 




Est 


SE 


Trt 


LI 


OTL 


MC 


RS 


CK 


RC 


WC 


WL DRP 


Instructional practices 
























Literacy instruction (LI) 


.20** 


.06 


.45** 


















Content coverage (OTL) 


.00 


.04 


.00 


-.14 
















Student reading processes 
























Metacognition (MC) 


.15 


.12 


.28 


.28 


.34* 














Reading strategies (RS) 


.03 


.20 


.13 


.25 


.21 


.47** 












Other student outcomes 
























Content knowledge (CK) 


-.38 


.23 


-.24 


.06 


.39* 


.40** 


.29 










Reading comprehension (RC) 


-.38 


.32 


-.12 


.00 


.35 


.52** 


.42** 


.69** 








Writing content (WC) 


.01 


.15 


-.12 


-.09 


.12 


.46** 


.28 


.55** 


.65** 






Writing language (WL) 


.08 


.15 


-.04 


-.06 


.17 


.55** 


.25 


.52** 


.66** 


.94** 




Degrees of reading power (DRP) 


-.77 


3.45 


.06 


.05 


.02 


.45** 


.07 


.44** 


.48** 


.47** 


.48** 


Biology CST 


-12.34 


12.15 


-.56* 


-.17 


.16 


.40* 


.09 


.79** 


.70** 


.73** 


.72** .62** 



*p<.05; **p<.01 



35 






Table 1 1 



Analyses of Treatment Effects and Correlations Between Study Variables - History Sample 



Variable 


Trt effect 






Pearson correlations of class level means 






Est 


SE 


Trt 


LI 


OTL 


MC 


RS 


CK 


RC 


WC 


WL 


Instructional practices 
























Literacy instruction (LI) 


.21** 


.04 


.57** 


















Content coverage (OTL) 


.10* 


.05 


.36* 


















Student reading processes 
























Metacognition (MC) 


.08 


.08 


.10 


.13 


-.09 














Reading strategies (RS) 


.42 


.22 


.23 


.26 


-.06 














Other student outcomes 
























Content knowledge (CK) 


.39 


.26 


.18 


.23 


.20 


.26 


.24 










Reading comprehension (RC) 


-.01 


.31 


-.07 


-.07 


-.40 


.44** 


.30* 


.45** 








Writing content (WC) 


.12 


.14 


.09 


.18 


-.12 


.46** 


.43** 


.57** 


.68** 






Writing language (WL) 


.10 


.13 


.06 


.16 


-.09 


.39** 


.35* 


.54** 


.67** 


.94** 




Degrees of reading power (DRP) 


tO 

b 

to 


3.11 


-.14 


-.08 


-.22 


.31* 


.20 


.32* 


.71** 


.58** 


.63** 



*p<.05; **p<.01 



36 






There are, of course, many reasons that such effects might not be observed. Perhaps 
the most straightforward interpretation is that the proposed model (Figure 4) is incorrect. 
Specifically, while participation in RA professional development may lead to enhanced 
literacy instruction, this change in instruction may not affect student reading processes or 
other student outcomes. An alternative to this conclusion might be that it is too soon to 
observe any effects on student outcomes. From this perspective, changes in student variables 
may indeed be related to instruction (and so the model may be generally correct). However, 
those changes could take longer to develop and perhaps had not occurred when the ILA was 
administered. The correlations between the class-level mean scores for these variables are at 
least suggestive of positive relationships between the steps in the hypothesized model. The 
correlations between variables that are adjacent in Figure 4 are shaded in gray in Tables 10 
and 11. Given these apparent positive relationships, another possibility is that effects of 
treatment assignment on the student variables are attenuated by multiplication of modest 
stepwise effects. 

1.0 

£ 0 . 8 - 
i= 0.6 

T3 
u. 

£ 0.4 

13 

O 
io 

§ 0 . 2 - 
0 . 0 - 

Control Treatment 

Figure 5. Teacher implementation 
variables-Biology Sample. Literacy 
Instruction is shown in purple; content 
coverage is shown in blue. 

In addition, heterogeneity in implementation of the RA curriculum by teachers and 
utilization of RA strategies by students may further reduce the power of the study to detect 
overall effects of treatment assignment. As an example, Figure 5 presents boxplots of the 
biology teacher implementation variables for the two study groups. A similar pattern was 
observed among history teachers. As described previously, the average scores for content 
coverage are similar across groups, while the level of literacy instruction is generally higher 
in the treatment group. That said, there is substantial variability within the groups; in fact, 
the treatment and control groups display a substantial amount of overlap. As a consequence, 
it appeals that some students in control classrooms may have been exposed to levels of 




Literacy 

nstruction 



Content 

Coverage 



37 




